Understanding ggplot (continued)

We will continue to develop our understanding of ggplot. You should be familiar with the baseline property to proceed.

When you practice coding, you will encounter a lot of errors. The error message seems to be mysterious, but it is not random. We have already seen a few problems when an aesthetic is mistakenly set to a constant value instead of being mapped to a variable.

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color = "pink"))
p + geom_point() + geom_smooth(method = "loess") + scale_x_log10()
## `geom_smooth()` using formula 'y ~ x'

In this lecture, we will discuss some useful features of ggplot that also commonly cause trouble.

Some keywords you need to know:

The ggplot library is an implementation of the Grammar of graphics, an idea developed by Wilkinson (2005). It consists of several rules. If you break a rule, it will throw an error without any result. For example, you omitted + sign between ggplot object and geom_ functions. It is usually referred to as syntax error, which is easily captured. Other times, you made mistakes in your codes, but the codes did not break any rules. Or you might use wrong information, for example, different column inputs. How could we handle them?

Let’s see some common errors.

1. group argument in aes()

Go back to gapminder dataset. Suppose I want to each country’s GDP per capita by time. We have year, lifeExp, and country variables, so running

p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p + geom_line() 

would provide the general trend line. Let’s see:

p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p + geom_line() 


You can guess what geom_line() does from:

p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p + geom_point() 

gapminder %>% arrange(year) %>% head(10)
## # A tibble: 10 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Albania     Europe     1952    55.2  1282697     1601.
##  3 Algeria     Africa     1952    43.1  9279525     2449.
##  4 Angola      Africa     1952    30.0  4232095     3521.
##  5 Argentina   Americas   1952    62.5 17876956     5911.
##  6 Australia   Oceania    1952    69.1  8691212    10040.
##  7 Austria     Europe     1952    66.8  6927772     6137.
##  8 Bahrain     Asia       1952    50.9   120447     9867.
##  9 Bangladesh  Asia       1952    37.5 46886859      684.
## 10 Belgium     Europe     1952    68    8730405     8343.


While ggplot will make a pretty good guess as to the structure of the data, it does not know that the yearly observations in the data are grouped by country. We have to tell it. In fact, geom_line() starts with observation in 1952 in the first row of data and joins all 1952 data.

When you produce a plot but it looks weird, the problem is most likely in the mapping between the data and aesthetics for the geom_ being used.

In this case, we can use the group argument in aes() to tell ggplot explicitly about this structure.

p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p + geom_line(aes(group = country)) 


It looks rough, but you will see that each line represents country’s

Think about what’s happening here?:

p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p + geom_line(aes(group = continent)) 

2. Facet: multi plots

The last plot would be the one, but it is messy. Creating multiple plots in one panel would help. It would allow a lot of information to be compactly and comparably presented. This is called faceting data by some other variables. We have continent variable, which splits the data into 5.

The facet_wrap() function can take a series of arguments, but the most important is the first one.

p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p + geom_line(aes(group = country))  + facet_wrap(~ continent)


~, tilde, is used for a formula in R syntax, and facets have only one side. Most of the time, you will just want a single variable on the right side of the formula.

Each facet is labeled at the top. The overall layout minimizes the duplication of axis labels and other scales. In fact, we can add the features we have learned in each facet. Let’s develop:

p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p + geom_line(aes(group = country), color = "honeydew3")  + 
    facet_wrap(~ continent, ncol = 5)

# see where color argument is located!
# what's the role of ncol argument?


Add smoother

p + geom_line(aes(group = country), color = "honeydew3")  + 
    facet_wrap(~ continent, ncol = 5) +
    geom_smooth(size = 1 , method = "loess", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

# check out the arguments in geom_smooth


Scale

p + geom_line(aes(group = country), color = "honeydew3")  + 
    facet_wrap(~ continent, ncol = 5) +
    geom_smooth(size = 1 , method = "loess", se = FALSE) +
    scale_y_log10(labels=scales::dollar)
## `geom_smooth()` using formula 'y ~ x'


Add labels

p + geom_line(aes(group = country), color = "honeydew3")  + 
    facet_wrap(~ continent, ncol = 3) +
    geom_smooth(size = 1 , method = "loess", se = FALSE) +
    scale_y_log10(labels=scales::dollar) +
    labs(x = "Year", y = "GDP per capita", title = "GDP per capita on Five Continents")
## `geom_smooth()` using formula 'y ~ x'

The facet_wrap() function is best used when you want a series of small multiples based on a single categorical variable. Your panels will be laid out in order and then wrapped into a grid.

facet_grid() might be useful when you want to facet your data more than 2 categorical variables. See the Link. Let’s use another datafile. The following command would throw an error in your machine: why?

setwd("~/Documents/ibs_course/BUS240/data")
load('gss_sm.rda')
head(gss_sm, 10)
## # A tibble: 10 × 32
##     year    id ballot   age childs sibs  degree race  sex   region incom…¹ relig
##    <dbl> <dbl> <labe> <dbl>  <dbl> <lab> <fct>  <fct> <fct> <fct>  <fct>   <fct>
##  1  2016     1 1         47      3 2     Bache… White Male  New E… $17000… None 
##  2  2016     2 2         61      0 3     High … White Male  New E… $50000… None 
##  3  2016     3 3         72      2 3     Bache… White Male  New E… $75000… Cath…
##  4  2016     4 1         43      4 3     High … White Fema… New E… $17000… Cath…
##  5  2016     5 3         55      2 2     Gradu… White Fema… New E… $17000… None 
##  6  2016     6 2         53      2 2     Junio… White Fema… New E… $60000… None 
##  7  2016     7 1         50      2 2     High … White Male  New E… $17000… None 
##  8  2016     8 3         23      3 6     High … Other Fema… Middl… $30000… Cath…
##  9  2016     9 1         45      3 5     High … Black Male  Middl… $60000… Prot…
## 10  2016    10 3         71      4 1     Junio… White Male  Middl… $60000… None 
## # … with 20 more variables: marital <fct>, padeg <fct>, madeg <fct>,
## #   partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, grass <fct>,
## #   zodiac <fct>, pres12 <labelled>, wtssall <dbl>, income_rc <fct>,
## #   agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, religion <fct>,
## #   bigregion <fct>, partners_rc <fct>, obama <dbl>, and abbreviated variable
## #   name ¹​income16

This is a sample of the General Social Survey in 2016. Compared to gapminder, this data set contains many categorical variables. Play it around what information is available.

See the following:

p <- ggplot(data = gss_sm, mapping = aes(x = age, y = childs))
p + geom_point() 
## Warning: Removed 18 rows containing missing values (geom_point).

It would indicate the relationship between the age of the respondent and the number of children they have. We will then facet this relationship by sex and race of the respondent.

p <- ggplot(data = gss_sm, mapping = aes(x = age, y = childs))
p + geom_point()+facet_grid(sex ~ race)
## Warning: Removed 18 rows containing missing values (geom_point).


Add details on scatterplat

p <- ggplot(data = gss_sm, mapping = aes(x = age, y = childs))
p + geom_point(alpha = .3)+facet_grid(sex ~ race) + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 18 rows containing non-finite values (stat_smooth).
## Warning: Removed 18 rows containing missing values (geom_point).

3. When Mapping is not clear

p <- ggplot(data = gss_sm, mapping = aes(x = bigregion))
p + geom_bar()

There is only one mapping here. Note the y-axis. count is not in the data set. In fact, geom_bar() calls the stat_count() inside and calculate the number of count for the corresponding values. This function also calculates the proportion.

p + geom_bar(mapping = aes(y = ..prop..))

Ignore the figure now and just see how we access to the inside operations. We need to put the prop statistic. When ggplot calculates the count or the proportion, it returns temporary variables that we can use as mappings in our plots. To make sure these temporary variables won’t be confused with others we are working with, it should be mapping = ..statistic..

But still the figure looks not right. This is because of grouping.

p + geom_bar(mapping = aes(y = ..prop.., group = 2))

We need to force ggplot to use whole dataset instead of x-categories when calculating proportions. group = ‘pink’ is just a kind of “dummy group”. You can use anything, for example group = 2, creating a dummy.

Color?

p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, color = bigregion))
p + geom_bar()

Note that fill is for painting the insides of shapes (remember ribbons?).

p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = bigregion))
p + geom_bar()

4. More on fill

Take a look

table(gss_sm$region)
## 
##     New England Middle Atlantic E. Nor. Central W. Nor. Central  South Atlantic 
##             175             313             502             193             550 
## E. Sou. Central W. Sou. Central        Mountain         Pacific 
##             205             297             235             397

Consider we want to look at religious preference by census region. You might recall the color argument in aes. Good, but we need to use fill.

p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion))
p + geom_bar()

Note that region of the country is on the x-axis, and counts of religious preference are stacked within the bars.

To see the relative share:

p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion))
p + geom_bar(position = 'fill')

Note that the position argument in geom_bar() to “fill” which is not the same argument in aes().

What if we want to show the separate bars instead of showing the stacked?

p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion))
p + geom_bar(position = 'dodge')

To convert it to see the proportions,

p + geom_bar(position = 'dodge',
             mapping = aes(y = ..prop..))

p + geom_bar(position = 'dodge',
             mapping = aes(y = ..prop.., group = religion))

When we just wanted the overall proportions for one variable, we mapped group = 1 to tell ggplot to calculate the proportions with respect to the overall N. In this case our grouping variable is religion, so we might try mapping that to the group aesthetic.

Or you can let facet function do the work

p <- ggplot(data = gss_sm, mapping = aes(x = religion))
p + geom_bar(position = "dodge", mapping = aes(y = ..prop.., group = bigregion)) +
    facet_wrap(~ bigregion, ncol = 2)

5. Do not confuse with Histogram

What is a histogram?

head(midwest, 10)
## # A tibble: 10 × 28
##      PID county    state  area poptotal popden…¹ popwh…² popbl…³ popam…⁴ popas…⁵
##    <int> <chr>     <chr> <dbl>    <int>    <dbl>   <int>   <int>   <int>   <int>
##  1   561 ADAMS     IL    0.052    66090    1271.   63917    1702      98     249
##  2   562 ALEXANDER IL    0.014    10626     759     7054    3496      19      48
##  3   563 BOND      IL    0.022    14991     681.   14477     429      35      16
##  4   564 BOONE     IL    0.017    30806    1812.   29344     127      46     150
##  5   565 BROWN     IL    0.018     5836     324.    5264     547      14       5
##  6   566 BUREAU    IL    0.05     35688     714.   35157      50      65     195
##  7   567 CALHOUN   IL    0.017     5322     313.    5298       1       8      15
##  8   568 CARROLL   IL    0.027    16805     622.   16519     111      30      61
##  9   569 CASS      IL    0.024    13437     560.   13384      16       8      23
## 10   570 CHAMPAIGN IL    0.058   173025    2983.  146506   16559     331    8033
## # … with 18 more variables: popother <int>, percwhite <dbl>, percblack <dbl>,
## #   percamerindan <dbl>, percasian <dbl>, percother <dbl>, popadults <int>,
## #   perchsd <dbl>, percollege <dbl>, percprof <dbl>, poppovertyknown <int>,
## #   percpovertyknown <dbl>, percbelowpoverty <dbl>, percchildbelowpovert <dbl>,
## #   percadultpoverty <dbl>, percelderlypoverty <dbl>, inmetro <int>,
## #   category <chr>, and abbreviated variable names ¹​popdensity, ²​popwhite,
## #   ³​popblack, ⁴​popamerindian, ⁵​popasian

midwest is a pre-installed dataset in ggplot, including information on counties in the midwest.

p <- ggplot(data = midwest, mapping = aes(x = area))
p + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p + geom_histogram(bins = 10)

While histograms summarize single variables, it’s also possible to use several at once to compare distributions. We can facet histograms by some variable of interest, or as here we can compare them in the same plot using fill().

two_states <- c("IL", "MI")

p <- ggplot(data = subset(midwest, subset = state %in% two_states),
            mapping = aes(x = percollege, fill = state))
p + geom_histogram(alpha = 0.4, bins = 20)

We subset the data here to pick out just two states. Here, illinois and Michigan. Then we use the subset() function to take our data and filter it so that we only select rows whose state name is in this vector. The %in% operator is a convenient way to filter on more than one term in a variable when using subset().

We can similarly build a density function.

p <- ggplot(data = midwest, mapping = aes(x = area))
p + geom_density()

p <- ggplot(data = midwest, mapping = aes(x = area, fill = state, color = state))
p + geom_density(alpha = 0.3)

p <- ggplot(data = subset(midwest, subset = state %in% two_states),
            mapping = aes(x = area, fill = state, color = state))
p + geom_density(alpha = 0.3, mapping = (aes(y = ..scaled..)))

6. Plot summary table

setwd("~/Documents/ibs_course/BUS240/data")
load('titanic.rda')
head(titanic, 10)
##       fate    sex    n percent
## 1 perished   male 1364    62.0
## 2 perished female  126     5.7
## 3 survived   male  367    16.7
## 4 survived female  344    15.6
p <- ggplot(data = titanic, mapping = aes(x = fate, y = percent, fill = sex))
#p + geom_bar(position = "dodge")

p + geom_bar(position = "dodge", stat = "identity")